CRFsuite Tutorial

http://www.chokkan.org/software/crfsuite/tutorial.html

Task description

In this example, NP stands for a noun phrase, VP for a verb phrase, and PP for a prepositional phrase.

sequential labeling task（系列ラベリング）

IOB2 notation（IOB）

The goal of this tutorial is to build a model that predicts chunk labels for a given sentence (sequence of tokens) by using CRFsuite.

Training and testing data

CoNLL 2000 shared task

例：London JJ B-NP

The data consists of a set of sentences (sequences) each of which contains a series of words (e.g., 'London', 'shares'), part-of-speech tags (e.g., 'JJ', 'NNS'), and chunk labels (e.g., 'B-NP', 'I-NP') separated by space characters.

Necessary scripts for this tutorial are included under example directory in the CRFsuite distribution.

（lessってtxt.gzの中、見えるんだ）

In this tutorial, we would like to construct a CRF model that assigns a sequence of chunk labels, given a sequence of words and part-of-speech codes.

「単語と品詞コードの系列が与えられたときに、チャンクラベルの系列を割り当てるCRFモデルを構築する」

Feature (attribute) generation

https://github.com/chokkan/crfsuite/blob/0.12/example/chunking.py

{train,test}.txt.gzを{train,test}.crfsuite.gzに変換する

タブ区切りの特徴量

In general, this is the most important process for machine-learning approaches because a feature design greatly affects the labeling accuracy.

（ディープラーニング前なので特徴量エンジニアリングの比重が大きい）

In this tutorial, we extract 19 kinds of attributes from a word at position t (in offsets from the begining of a sequence)

前後2単語（w[t-2], w[t-1], w[t], w[t+1], w[t+2]）

単語の連続（w[t-1]|w[t], w[t]|w[t+1]）

前後2単語の品詞（pos[t-2], pos[t-1], pos[t], pos[t+1], pos[t+2]）

品詞の連続

1語の品詞考慮（pos[t-2]|pos[t-1], pos[t-1]|pos[t], pos[t]|pos[t+1], pos[t+1]|pos[t+2]）

（訓練データ（＝コーパス）全体を通して数え上げる？）

CRFsuite will learn associations between these attributes (e.g, "pos[0]|pos[1]|pos[2]=DT|JJ|NN") and labels (e.g., "B-NP") to predict a label sequence for a given text.

The convention "name=value" is merely for the convenience to interpret attribute names

CRFsuite accepts any string as an attribute name as long as the string does not contain a colon character

Training

crfsuite learn

-mで指定したモデルができる

You can also train a CRF model, watching its performance (accuracy, precision, recall, f1 score) evaluated on the test data.

crfsuite learn -e2

Tagging

crfsuite tag

Dumping the model file

crfsuite dump

Notes on writing attribute extractors

https://github.com/chokkan/crfsuite/blob/0.12/example/chunking.py の解説（書き換えられるように）

the common staffs (attribute generation from templates, data I/O, etc) are implemented in other modules, crfutils.py and template.py.

chunking.pyは3つの変数を定義するだけ

separator

separator character(s) of an input data

例では半角スペース

fields

field name(s) (ordered from left to right) of an input data, separated by a space character

3列：w, pos, y

London(=w) JJ(=pos) B-NP(=y)

templates

attribute (feature) templates written as a Python tuple/list object

a tuple/list of (name, offset) pairs, in which name presents a field name, and offset presents an offset to the current position.

(('w', -2), ),＝w[t-2]

(('w', -1), ('w', 0)),

the bigram starting at the previous token

（単語の連続を表している）

Feature extractors for other tasks

https://github.com/chokkan/crfsuite/tree/0.12/example